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Abstract 

Background: Protein-protein interactions can be seen as a liierarcliical process occurring at tliree related levels: 
proteins bind by means of specific domains, which in turn fornn interfaces through patches of residues. Detailed 
knowledge about which domains and residues are involved in a given interaction has extensive applications to 
biology, including better understanding of the binding process and more efficient drug/enzyme design. Alas, most 
current interaction prediction methods do not identify which parts of a protein actually instantiate an interaction. 
Furthermore, they also fail to leverage the hierarchical nature of the problem, ignoring otherwise useful information 
available at the lower levels; when they do, they do not generate predictions that are guaranteed to be consistent 
between levels. 

Results: Inspired by earlier ideas of Yip etal. (BMC Bioinformatics 10:241, 2009), in the present paper we view the 
problem as a multi-level learning task, with one task per level (proteins, domains and residues), and propose a 
machine learning method that collectively infers the binding state of all object pairs. Our method is based on 
Semantic Based Regularization (SBR), a flexible and theoretically sound machine learning framework that uses First 
Order Logic constraints to tie the learning tasks together. We introduce a set of biologically motivated rules that 
enforce consistent predictions between the hierarchy levels. 

Conclusions: We study the empirical performance of our method using a standard validation procedure, and 
compare its performance against the only other existing multi-level prediction technique. We present results showing 
that our method substantially outperforms the competitor in several experimental settings, indicating that exploiting 
the hierarchical nature of the problem can lead to better predictions. In addition, our method is also guaranteed to 
produce interactions that are consistent with respect to the protein-domain-residue hierarchy. 



Background 

Physical interactions between proteins are the workhorse 
of cell life and development [1], and play an extremely 
important role both in the mechanisms of disease [2] 
and in the design of new drugs [3]. In recent years, 
there has been enormous interest in reverse engineer- 
ing the protein-protein interaction (PPI) networks of 
several species, particularly due to the availability of high- 
throughput experimental techniques, leading to an abun- 
dance of large databases on all aspects of PPIs [4]. 
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Notwithstanding the increased availability of interac- 
tion data, the natural question of whether two arbitrary 
proteins interact, and why, is still open. The growing 
literature on protein interaction prediction [4-6] is symp- 
tomatic of the gap separating the amount of available data 
and the effective size of the interaction network [7]. The 
present paper is a contribution towards filling this gap. 

Our work is based on the observation that physical 
interactions can be viewed at three levels of detail. At 
a higher level, two proteins interact to perform some 
function within a biological pathway (e.g. metabolism, 
signaling, regulation, etc) [8]. At a lower level, the same 
interaction occurs between a pair of specific domains 
appearing in the proteins; the types of the domains 
involved characterize the functional semantics of the 
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interaction [9]. At the lowest level, the interaction is 
instantiated by the binding of a pair of protein inter- 
faces, patches of solvent accessible residues with compat- 
ible shapes and chemical properties [10]. The low-level 
features of the binding sites determine whether the inter- 
action is transient or permanent, whether two proteins 
compete for interaction with a third one, etc. Figure 1 illus- 
trates the multi-level mechanisms with an example taken 
from the PDB. 

Despite the significance of low-level details in elu- 
cidating the mechanics of protein-protein interactions, 
most of the current experimental data comes from high- 
throughput screening techniques, such as yeast two- 
hybrid (Y2H) assays [11]. These techniques do not provide 
information on domain- or residue-level interactions, 
which require solving the three-dimensional structure of 
each protein-protein complex, an expensive and time con- 
suming task addressed by X-Ray crystallography, NMR, or 
electron microscopy techniques [12]. As a consequence, 
protein-protein interaction data is under-characterized 
at the domain and residue levels: the current databases 
are relatively lacking when compared to the magnitude 
of the existing body of data about protein-level interac- 
tions [13]. At the time of writing, the PDB hosts 84,418 
structures, but merely 4,210 resolved complexes (accord- 
ing to http : / /www. rcsb . org/pdb/ statistics/ 
holdings, do, retrieved on 2013/06/20). The latter 
cover only a tiny fraction of the interactions stored in 
databases such as BioGRID and MIPS. 

From a purely biological perspective, predictions at dif- 
ferent levels have several important applications. The 
network topology and individual features of protein inter- 
actions are an essential component of a wide range of 
biological tasks: inferring protein function [14] and local- 
ization [15], reconstructing signal and metabolic pathways 
[16], discovering candidate targets for drug development 
[2]. Finer granularity predictions at the domain level 
allow to discover affinities between domain types that 
can be carried over to other proteins [17,18]; domain- 
domain networks have also been assessed as being typ- 
ically more reliable than their protein counterparts [13]. 



Finally, residue-level predictions, i.e., interface recogni- 
tion, enable the detailed study of the principles of protein 
interactions, and are crucial for tasks such as rational 
drug design [3], metabolic reconstruction and engineering 
[19], and identification of hot-spots [20] in the absence of 
structure information. 

Given the usefulness of knowing the details of protein- 
protein interactions at diverse levels of detail, and based 
on earlier ideas of Yip et al, [21], in this paper we address 
the problem of collectively predicting the binding state 
of all proteins, domains, and residues in a network. We 
call this task the multi-level protein interaction prediction 
problem (MLPIP for short). 

From a computational point of view, the most important 
feature of the multi-level prediction problem is its inher- 
ently relational nature. Proteins, domains and residues 
are organized in a hierarchy, which dictates constraints 
on the binding state of pairs of objects at the differ- 
ent levels, as follows. On the one hand, whenever two 
proteins are bound, at least two of their domains must 
also be bound, and, similarly, there must be residues 
in the two domains that form an interface. On the 
other hand, if no residues of the two proteins interact, 
neither do their domains, nor the proteins themselves. 
In other words, predictions at different levels must be 
consistent. 

In this paper we cast the multi-level prediction problem 
as a statistical-relational learning task, leveraging the lat- 
est developments in the field. Our prediction method is 
based on Semantic Based Regularization [22], an elegant 
semi-supervised prediction framework that caters both 
the effectiveness of kernel machines and the expressiv- 
ity of First Order Logic (FOL). The constraints described 
above are encoded as FOL rules, which are used to enforce 
consistent predictions at all levels of the interaction hier- 
archy. By computing multi-level predictions, our method 
can not only infer which protein pairs are likely to interact, 
but also provide details about how the interactions take 
place. Our empirical evaluation shows the effectiveness of 
this constraint-based approach in boosting predictive per- 
formance, achieving substantial improvements over both 




Figure 1 The protein-domain-residue hierarchy. Two bound proteins and their interacting domains and residues, captured in PDB complex 
4I0P. Tlie proteins are a Killer cell lectin-like receptor (in violet) and its partner, a C-type lectin domain protein (in blue). (Left) Interaction as visible 
from the contact surface. (Center) The two C-type lectin domains instantiating the interaction. (Right) Effectively interacting residues in red. 
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an unconstrained baseline and the only existing alterna- 
tive MLPIP method [21]. 

Problem definition 

PPI networks are most naturally formalized as graphs, 
where nodes represent proteins and edges represent inter- 
actions. Given a set of features describing the properties of 
the proteins in the network (e.g. primary structure, local- 
ization, tertiary structure — when available — , etc.)^ infer- 
ring the PPI network topology amounts to determining 
those pairs of proteins that are likely to interact. This task 
is often cast as a pairwise classification problem, where a 
binary classifier takes as input a pair of proteins (or rather 
their feature-based representations) and predicts whether 
they interact or not. Standard binary classification meth- 
ods, such as Support Vector Machines [23], can be used to 
implement the pairwise classifier. In this setting, the inter- 
action depends only on the features of the two incident 
nodes, and is independent of all other nodes. Interactions 
between domains or residues can be predicted similarly. 

The most straightforward way to address the MLPIP 
problem is to cast the three interaction prediction prob- 
lems, for proteins, domains and residues respectively, as 
independent pairwise classification tasks. However, as 
previously discussed, these problems are clearly strongly 
related: two proteins interact via one or more domains, 
which in turn contain patches of residues that consti- 
tute the interaction surface. Ignoring these relationships 
can lead to heavily suboptimal, inconsistent predictions, 
where, e.g. two proteins are predicted to interact but 
none of their domains are predicted to be involved in this 
interaction. Making these relationships explicit and forc- 
ing predictors to satisfy consistency constraints is the key 
contribution of this work. In the machine learning com- 
munity, this kind of scenario characterized by multiple 
related prediction tasks is usually cast as a statistical- 
relational learning problem [24,25], where the goal is to 
collectively classify the state of all objects of interest, tak- 
ing into account the relations existing between them. The 
solution we adopt is grounded in this learning framework. 

Overview of the proposed method 

In this paper we propose solving the multi-level prediction 
problem adapting a state-of-the-art statistical-relational 
learning framework, namely Semantic Based Regulariza- 
tion (SBR) [22]. SBR ties multiple learning tasks, which 
are themselves addressed by kernel machines, using con- 
straints expressing First Order Logic knowledge. In the 
following we give an overview of the SBR framework, also 
pictured in Figure 2; see Methods for further details. 

Let A' be a set of objects. In most scenarios, objects 
are typed, so that objects of the same type can be con- 
sidered as belonging to the same group. In our setting, 
object types are proteins, domains and residues, with 



corresponding sets Xp, Xd and Xr respectively. Predicates 
represent properties of objects or relationships between 
them. Depending on the scenario, some predicates are 
always known (called given predicates), some other are 
known only for a subset of the objects, and their value 
should be predicted when unknown {query or target pred- 
icates). The parentpd (p, d) predicate, for instance, 
specifies that domain d g X^ is part of protein p g Xp, i.e. 
the predicate is true for all (p , d) pairs for which d is a 
domain of p, and false otherwise. The value of this predi- 
cate is known for all objects in our domain; note that there 
indeed are many proteins whose domains are unknown, 
but in this case there is no corresponding domain object 
in our data). The boundp(p,p') predicate specifies 
whether two proteins p and p ' are interacting. This is one 
of the target predicates, whose truth value should be pre- 
dicted for novel protein-protein pairs. Similar predicates 
are defined for domain and residue level bindings. Target 
predicates are modelled as binary classifiers, i.e. functions 
trained to predict the truth value of the predicate. Rela- 
tionships between predicates can be introduced in order 
to enforce constraints known to hold in the domain. SBR 
allows to exploit the full power of First Order Logic in 
doing this. As a matter of example, the notion that two 
interacting proteins should have at least one interacting 
domain can be modelled as (see Methods for details on 
First Order Logic notation): 

V(p,p' )boundp(p,pM =^3 (d,dM boundd(d,'d) A 

parentpd (p, d) A 
parentpd (p' , d' ) 

Each binary classifier is implemented in the SBR frame- 
work as a kernel machine [26]. The key component of 
kernel machines is the kernel function, which measures 
the similarity between objects in terms of their repre- 
sentations. A protein, for instance, can be represented 
as the sequence of its residues, plus additional infor- 
mation as its subcellular localization and/or its phylo- 
genetic profile. Having the same subcellular localization, 
for instance, should increase the similarity between two 
proteins, as having a similar amino acid composition. 
Designing appropriate kernels is a crucial component of a 
successful predictor. A kernel machine is a function which 
predicts a certain property of an object x in terms of a 
weighted sum of similarities to other objects for which the 
property is known, i.e.: 

i 

A kernel machine could for instance predict whether a 
protein is an enzyme or not (binary classification), in 
terms of weighted similarity to other proteins. Being sim- 
ilar to an enzyme Xi will drive the prediction towards 
the positive (enzyme) class (positive weight vi//), while 



Sacca etol. BMC Bioinformotics 2014, 15:103 
http://www.bionnedcentral.conn/1 471 -21 05/1 5/1 03 



Page 4 of 18 



(a) Kernel preparation 
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(b) Predicates applied to a protein pair 
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Figure 2 Visualization of tiie proposed metiiod. Visualization of tine proposed metliod. (a) Kernel preparation at the three levels. A kernel is 
derived for each input feature (Left); the resulting matrices are summed up to obtain a per-object kernel (Middle), which is transformed into a 
pairwise kernel using Eq 1 . Here Np (A/j, Nr) is the number of individual proteins (respectively domains, residues) in the level, while Npp {N^d, Nrr) is 
the number of protein (respectively domain, residue) interactions in the dataset. (b) Instantiation of all predicates (Table 1 ) over a pair of proteins p ' 
and p ' and their parts. Circles represent proteins, domains and residues. Dotted lines indicate a parent-child relationship between objects, 
representing the parentpd and parentdr predicates. Solid lines link pairs of bound objects, i.e. objects for which the boundp, boundd or 
boundr predicates are true, (c) Visualization of the experimental pipeline. Given the pairwise kernels, the set of rules (Table 2), a set of example 
interactions, and a description of the protein-domain-residue hierarchy, SBR finds a prediction for the query predicates consistent with the rules. 



Sacca era/. BMC Bioinformatics 2014, 15:103 
http://www.biomedcentral.eom/1 471 -21 05/1 5/1 03 



Page 5 of 18 



being similar to a non-enzyme xj will drive the prediction 
towards the opposite class (negative weight wj). 

In the interaction prediction setting, target predicates 
actually predict properties of pairs of objects (proteins, 
domains or residues). We thus employ a pairwise kernel 
machine classifier to model the target predicate: 

i 

Here the kernel function measures the similarity between 
two pairs of objects, so that, e.g. two proteins will be pre- 
dicted as interacting if they are similar to protein pairs 
which are known to interact, and dissimilar from pairs 
known to not interact. 

Given a kernel between objects K(x, x^), it is possible to 
construct a pairwise kernel by means of a the following 
transformation [27]: 

K((xi, Xj), {xky xi)) = K(xi, Xk) ' K(xi, xi) + 
K{xj, Xk) • K{xj, xi) 

This transformation guarantees that, if the input func- 
tion K is a valid kernel, so is the resulting pairwise 
function. 

As already explained, in SBR each target predicate is 
implemented as a kernel machine, and the state of a 
predicate for an uncharacterized pair of proteins can be 
inferred by querying the machine. Positive predictions 
correspond to true predicates, i.e. bound protein pairs, 
and negative predictions to false ones. The confidence 
of the kernel machine, also called margin, embodies the 
confidence in the state of the predicate, that is, how 
strongly two proteins are believed to interact (or not). 

Given the output of the kernel machines for all target 
predicates, SBR uses the First Order Logic rules to con- 
dition the state of the correlated predicates. It does so by 
first translating the FOL rules into continuous constraints, 
which we discuss more thoroughly in Methods. The vari- 
ables coming into play into the continuous constraints 
are the confidences of all target predicates (and the state 
of all given predicates) appearing in the equivalent FOL 
constraint. The amount of violation is reflected by the 
value of the continuous constraints: if the predicted pred- 
icates satisfy a FOL rule, the corresponding constraint will 
have a value equal to 1; on the other hand, the closer the 
constraint value to zero, the more the FOL rule is violated. 

SBR computes a solution to the inference problem, i.e. 
deciding the truth value of all target predicates, that maxi- 
mizes both the confidence of individual predicates and the 
amount of satisfaction of all constraints. Informally, the 
optimal assignment to all predicates, i.e. the binding state 



of protein, domain and residue pairs, y*, is a solution to 
the following optimization problem: 

y* = arg max consist(j,/) + consist (j, /<S) 

y 

where the first term accounts for consistency between 
inferred truth values and confidence of the individual pre- 
dictions, and the second incorporates information on the 
degree of satisfaction of the constraints build from the 
FOL knowledge. Contrarily to standard kernel methods, 
this optimization problem is non-convex. This is com- 
monly the case for complex statistical-relational learning 
tasks [24], and implies that we are restricted to finding 
local optima. SBR employs a two-stage learning process to 
make training effective even in presence of local optima. 
In particular, the first stage of SBR learning takes into 
account the fitting of the individual predictions to the 
supervised data. This learning task is convex and can be 
efficiently solved. The solution found in the first stage is 
used as starting point for a second stage, where the FOL 
knowledge is also considered. This optimization strategy 
has been experimentally proved to find high-quality solu- 
tions without adding the computational burden of other 
non-convex optimization techniques [22]. 

SBR is a semi-supervised method [28], meaning that 
the set of target proteins is given beforehand and can be 
exploited during the learning stage to fine-tune the model. 
Semi-supervised learning is known to enhance the pre- 
diction ability when appropriately used [29], and can be 
applied very naturally to PPI prediction, as the full set of 
proteins is always known. 

To summarize, at each level the state of an uncharac- 
terized pair of objects, e.g. proteins p and p' , is mainly 
inferred by the similarity of the pair {p,p') to other pairs 
that are known to interact or not, through the pair- 
wise kernel function K and the learned weights w. Thus 
the kernel allows to propagate information horizontally 
within the same level. At the same time, the FOL con- 
straints allow to propagate information vertically between 
the levels, by keeping the interaction pattern along the 
protein-domain-residue hierarchy consistent. 

Modeling multi-level interactions 

As already explained, we use two distinct kinds of predi- 
cates: given predicates and target predicates. Given pred- 
icates encode a priori knowledge about the problem, in 
our case the structure of the multi-level object hierar- 
chy. In particular, given a protein p and a domain d, the 
parentpd (p, d) predicate is true if and only if domain d 
occurs in protein p; the parentdr predicate is the analo- 
gous for domains and residues. This simple representation 
suffices to encode the whole protein-domain-residue 
hierarchy. To simplify the notation, we also introduce the 
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hasdom (p) predicate to encode the fact that protein p 
has at least one domain. More formally: 

hasdom (p) := 3 d parentpd (p , d) 

The hasdom predicate can be computed directly by SBR 
using the above definition; we instead pre-compute its 
value for all protein pairs for run-time efficiency. 

The boundp (p , p ' ) target predicate models the bind- 
ing state of two distinct proteins. Its state is known for 
certain protein pairs, i.e. those in the training set, and 
our goal is to predict its state on the remaining ones. 
The boundd ( d , d ' ) predicate plays the same role for 
domains. For a complete list of predicates, see Table 1. For 
a visualization of the predicates instantiated over a protein 
pair, see Figure 2-b. 

In what follows we describe how to design inter-level 
FOL constraints to properly enforce consistency between 
predictions at different levels. We focus on modeling the 
constraints tying proteins and domains; it is easy to see 
that the ones between domains and residues can be mod- 
elled similarly (with one peculiar exception that will be 
pointed out later). Table 2 reports the complete list of 
rules. 

Inter-level constraints can be seen as propagating infor- 
mation from the upper layer to the lower one and in the 
opposite direction. To model this mechanism, we use two 
distinct constraints: the P^D rule and the D^P rule. A 
simplified version of the P^D rule is: 

V (p,p' ) boundp (p,p' ) =^ 3(d,d' ) boundd (d,d' ) A 

parentpd (p, d) A 
parentpd (p' , d' ) 

Intuitively, the rule means that whenever two proteins 
are bound (and therefore the left-hand side (LHS) of the 
implication is true) then there must be at least one pair of 
child domains that are bound (the right-hand side (RHS) is 
true). In classical First Order Logic the rule would require 
that, whenever none of the child domains is bound (the 

Table! Predicates 



Target predicates 



boundp (p,p' ) 


true iff the protein pair (p,p' ) is bound 


boundd (d, d' ) 


true iff the domain pair (d, d' ) is bound 


boundr (r , r ' ) 


true iff the residue pair (r, r' ) is bound 


Given predicates 


parentpd (p, d) 


true iff protein p is parent of donnain d 


parentdr (d, r) 


true iff domain d is parent of residue r 


parentpr (p, r) 


true iff protein p is parent of residue r 


hasdom (p) 


true iff protein p has at least one domain 


hasres (d) 


true iff domain d has at least one residue 



List of predicates used by SBR. 



RHS is false), then the parent proteins must not be bound 
(the LHS is false). 

Note that, in the above formulation, the rule is applied 
indiscriminately to all protein pairs, even to those that 
have no known child domains in the considered dataset. 
Therefore, the rule can be reformulated in order to enforce 
it only for those protein pairs that do in fact have child 
domains, using the hasdom predicate, as follows: 

V (p,p') hasdom(p)A hasdom(p') =^ 

(boundp (p, p' ) ^ 3 (d, d' ) boundd (d, d' ) A 
parentpd (p, d) A 
parentpd (p' , d' ) ) 

This is the complete P^D rule. The left-hand side is 
always false for proteins without domains, making the 
rule always satisfied in this case (effectively disabling the 
effect of the rule on the learning process). We define the 
complementary D^P rule as follows: 

V(p,p') (3(d,dM boundd(d,d') A 
parentpd (p , d) A 
parentpd (p' , d' ) 
^ boundp (p,pM) 

This rule is applied to all protein pairs, demanding that 
if there is a pair of bound children domains then the pro- 
teins must be bound too, and vice versa that if the parent 
proteins are unbound so are the domains. The P^D and 
P rules could be merged into a single equivalent rule 
using the double implication (4^). However, the rules have 
been considered separately to keep their effects on the 
results separated and easier to analyze. 

To simulate the unidirectional information propaga- 
tion between levels, as done by Yip et al [21] (see 
Related work), we modified how SBR converts logic impli- 
cations by using the t-norm residuum, which states that 
a logic implication is true if the RHS is at least as true 
as the LHS. This modification also removes a bias in the 
translation of the implication that was affecting the origi- 
nal formulation of SBR, whose effect is to often move the 
LHS toward the false value. See Methods for details. 

The constraints for domains and residues can be simi- 
larly defined with one important exception. The P^D rule 
described above (correctly) requires at least one domain 
couple to be bound for each interacting protein pair. 
However, when two domains are bound, the interaction 
interface involves more than one residue pair: for instance, 
binding sites collected in the protein-protein docking 
benchmark version 3.0 [30] consist of 25 residues on aver- 
age [31]. We integrate this observation in the D^R rule 
using the fz-existential operator 3^ in place of the regu- 
lar existential (see Table 2 for the complete formulation), 
so that whenever two domains are bound, at least n pairs 
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Table 2 Rules 



Name 



P^R 



Definition 



P^D V (p,p') hasdom(p) Ahasdom(p') =^ 

boundp (p,p')=^3 (d,d') boundd (d, d' ) A parentpd (p, d) A parentpd (p' , d' ) 
D^P V(p,p' ) 3 (d,d' ) 

boundd (d,d') A parentpd (p, d) Aparentpd (p' , d' ) =^boundp (p, p' ) 
D^R V(d,d') hasres (d) A hasres (d' ) =^ 

boundd (d, d' ) =^ (r,r') boundr (r , r ' ) A parentdr (d, r) A parentdr (d' , r ' ) 
R^D V(d,d' ) 3(r,r' ) 

boundr ( r , r ' ) A parentdr ( d , r ) A parentdr ( d ' , r ' ) =^ boundd ( d , d ' ) 



Same as d^r, with proteins in place of domains 
Same as R^D, witli proteins in place of domains 



List of FOL constraints used by SBR. 



of their residues must be bound. Since interfaces in the 
employed dataset are typically 5 residues long, n = 5 has 
been used in the experiments. Our results demonstrate 
that this seemingly small modification has a rather exten- 
sive impact on the prediction of domain and residue level 
interactions. 

Related work 

In this section we briefly summarize previous PPI interac- 
tion prediction approaches using methods that are most 
closely related to the present paper: kernel methods, 
semi-supervised methods, and logic-based methods. For 
a broader exposition of interaction prediction methods, 
please refer to one of the several surveys on the subject 
[4,6,9,32]. 

The earliest attempt to employ kernel methods [26] for 
PPI prediction is the work of Bock et al [33], which casts 
interaction prediction as pairwise classification, using 
amino-acid composition and physico-chemical properties 
alone. Ben-Hur et al [27] extended the previous work by 
applying pairwise kernels and combining multiple data 
sources (primary sequence, Pfam domains, Gene Ontol- 
ogy annotations and interactions between orthologues). 
Successive publications focused primarily on aggregat- 
ing more diverse sources, including phylogenetic pro- 
files, genetic interactions, and subcellular localization and 
function [6]. Kernel machines have also been applied to 
the prediction of binding sites from sequence, as summa- 
rized in [10]. The appeal of supervised kernel methods is 
that they provide a proved and theoretically grounded set 
of techniques that can easily integrate various information 
sources, and can naturally handle noise in the data. How- 
ever, they have two inherent limitations: (i) the binding 
state of two proteins is inferred independently from the 
state of all other proteins, and (ii) due to their supervised 
nature, they do not take advantage of unsupervised data, 
which is very abundant in the biological network setting. 



Semi-supervised learning (SSL) techniques [28,29] 
attempt to solve these issues. In the SSL setting the set 
of target proteins is known in advance, meaning that the 
learning algorithm has access to their distribution in fea- 
ture space. This way the inference task can be simplified 
by introducing unsupervised constraints that assign the 
same label to proteins that are, e.g., close enough in feature 
space, or linked in the interaction network, instantiat- 
ing a form of information propagation. There are several 
works in the PPI literature that embed the known net- 
work topology using SSL constraints. Qi et al. [34] employ 
SSL methods to the special case of viral-host protein inter- 
actions, where supervised examples are extremely scarce. 
Using similar methods. You et al [35] attempt to detect 
spurious interactions in a known network by projecting 
it on a low-dimensional manifold. Other studies [36,37] 
applied SSL techniques to the closely related problems 
of gene-protein and drug-protein interaction prediction. 
Despite the ability of SSL to integrate topology informa- 
tion, no study so far has applied it to highly relational 
problems such as the MLPIP. 

An alternative strategy for interaction prediction is 
Inductive Logic Programming (ILP) [38], a group of logic- 
based formalisms that extract rules explaining the likely 
underlying causes of interactions. ILP methods were stud- 
ied in the work of Tran et al. [39] using a large number of 
features: SWISS-PROT keywords and enzyme properties. 
Gene Ontology functional annotations, gene expression, 
cell cycle and subcellular localization. Further advances 
in this direction, with a special focus on using domain 
information, can be found in [17,18]. The advantage of 
ILP methods over purely statistical methods is that they 
are inherently able to deal with relational information, 
making them ideal candidates for solving the MLPIP 
problem. Alas, contrary to kernel methods, they tend to 
be very susceptible to noise, which is a very prominent 
feature of interaction dataset, and are less effective in 
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exploiting complex feature representations, e.g. involv- 
ing highly non-linear interactions between continuous 
features. 

Recently, some works highlighted the importance of 
the multi-level nature of protein-protein interactions. 
Gonzalez et al [40] propose a method to infer the residue 
contact matrix from a known set of protein interactions 
using SVMs; on the contrary, our goal is to predict the 
interactions concurrently at all levels of the hierarchy. 
Another study [13] highlights the relevance of domain- 
level interactions, and the unfortunate lack of details 
thereof, and formulates a method to reinterpret a known 
PPI network in terms of its constituent domain interac- 
tions; the present work has a different focus and a more 
general scope. 

Most relevant to this paper is the work of Yip et al 
[21], where the authors propose a procedure to solve 
the MLPIP problem based on a mixture of different 
techniques. The idea is to decompose the problem as a 
sequence of three prediction tasks, which are solved iter- 
atively. Given an arbitrary order of the three levels (e.g. 
proteins first, then domains, then residues), their proce- 
dure involves computing putative interactions in the first 
level (in this case proteins), then using the most confident 
predictions as novel training examples at the following 
level (i.e., domains). The procedure is repeated until a 
termination criterion is met. 

Intra-level predictions are obtained with Support Vec- 
tor Regression (SVR) [41]. In particular, each object has 
an associated SVR machine that models its propensity to 
bind any other object in the same level. The extrapolated 
values act as confidences for the predictions themselves. 
The mechanism for translating the most confident pre- 
dictions at one level into training examples for the next 
level depends on the relative position of the two levels in 
the hierarchy. Downward propagation (e.g. from proteins 
to domains) simply associates to each novel example the 
same confidence as the parent prediction: in other words, 
if two proteins are predicted as bound with high confi- 
dence, all their domains will be considered bound with the 
same confidence. Upward propagation (e.g. from domains 
to proteins) is a bit more involved: the confidence assigned 
to the novel example (protein) is a noisy-OR combination 
of confidences for all the involved child objects (domains). 

While this method has been shown to work reason- 
ably well, it is afflicted by several flaws. First of all, while 
the iterative procedure is grounded in co-training [42], 
the specific choice of components is not as theoreti- 
cally sound. For instance, the authors apply regression 
techniques on a classification task, which may lead to 
sub-optimal results. The inter-level example propagation 
mechanisms are ad hoc, do not exploit all the informa- 
tion at each level (only the most confident predictions 
are propagated), and are designed to merely propagate 



information between levels, not to enforce consistency on 
the predictions. In particular, the downward propagation 
rule is rather arbitrary: it is not clear why all domains 
of bound proteins should be themselves bound with the 
same confidence. Finally, these rules, which are intimately 
tied to the specific implementation, are not defined using 
a formal language, and are therefore difficult to extend. 
For instance, it would be difficult to implement in said 
framework something similar to an ^-existential propa- 
gation rule, which is extremely useful for dealing with 
residue interactions. 

Semantic Based Regularization seems to have many 
obvious advantages in this context. A first advantage is 
that it decouples the implementation of the functions 
from how consistency among levels is defined. Indeed, 
consistency is implemented via a set of constraints, which 
are applied over the output of the predictors. However, 
there is no limitation to which kind of predictors are used. 
For example, we used kernel machines as basic machinery 
for implementing the predictor, where different state-of- 
the-art kernels can be used at the single levels, while still 
be able to define a single optimization problem. 

Furthermore, SBR allows to natively propagate the pre- 
dictions of one level to the other levels. Since the pre- 
dictions and not the supervisions are propagated, SBR 
accuracy can get advantage of the abundant unsupervised 
data. The availability of an efficient implementation of 
the n-existential quantifier is also a crucial advantage: if 
two proteins or domains are interacting, a small set of 
residues must be interacting as well. SBR does not simply 
propagate a generic prior to all the residues for a protein 
or domain, which could decrease accuracy of the reduc- 
tions for the negative supervisions. SBR instead performs 
a search process in order to select a subset of residue can- 
didates, where to enforce the interaction. As shown in the 
experimental results, this greatly improves residue pre- 
diction accuracy. Finally, the circular dependencies that 
make learning difficult are dealt in the context of a general 
and well defined framework, which implements various 
heuristics to make training effective. 

Results and discussion 

Dataset 

In this work we use the dataset of Yip et al [21], described 
here for completeness. The dataset represents proteins, 
domains and residues using features gathered from a 
variety of different sources: 

• Protein features include phylogenetic profiles derived 
from COG, subcellular localization, cell cycle and 
environmental response gene expression; 
protein-pair features were extracted from Y2H and 
TAP-MS data. The gold standard of positive 
interactions was constructed by aggregating 
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experimentally verified or structurally determined 
interactions taken from MIPS, DIP, and iPfam. 

• At the domain level, the dataset includes both 
features for domain families and for domain 
instances based on frequencies of domains within 
one or more species and phylogenetic correlations of 
Pfam alignments. The gold standard of positive 
interactions was built from 3D structures of 
complexed proteins taken from PDB. 

• Residue features consist of sequence-based 
properties, namely charge complementarity, Psi-Blast 
[43] profiles, predicted secondary structure, and 
predicted solvent accessibility. 

Kernels computed from the individual features were com- 
bined additively into a single kernel function for each 
level, and then transformed into pairwise kernels using 
Equation (1); the resulting functions were used as inputs 
to SBR. A visualization of the process can be found in 
Figure 2. 

This procedure yields a dataset of 1681 proteins, 2389 
domains, and 3035 residues, with a gold standard of 3201 
positive (interacting) protein pairs, 422 domain pairs, and 
2000 residue pairs. Since interaction experiments can not 
determine which pairs do not interact, the gold standard 
of negative pairs is built by randomly sampling, at each 
level, a number of pairs that are not known to interact 
{Le, not positive). This is a common approach to negative 
labeling in the PPI prediction literature [44]. To keep the 
dataset balanced, the number of sampled negative pairs 
is identical to the number of objects in the gold standard 
of positives. For more details on the dataset prepara- 
tion, please refer to [21]. We further refined the dataset 
by running CD-HIT [45] with a 20% sequence similar- 
ity threshold, identifying 23 redundant proteins. These 
proteins were not used when comparing the method per- 
formances. 

Turning our attention to the resulting dataset, we note 
that most of the supervision is located at the protein level: 
out of all possible interactions between pairs of proteins, 
which are (1681 x 1680)/2, 0.226% are known (either pos- 
itive or negative). On the contrary, the other levels hold 
much less information: only 0.042% of all possible residue 
pairs, and 0.014% of all possible domain pairs, are in the 
dataset. The low number of residue pairs is due to i) dif- 
ferent requirements for experimentally determining the 
interactions at the three levels, i.e. whether the structure 
is available; and ii) sampling choices operated by Yip et al, 
[21]. 

Evaluation procedure 

In this work we compare our method to that of Yip 
et al [21], where the authors evaluated their method 
using a 10-fold cross-validation procedure. To keep 



the comparison completely fair, we repeated said pro- 
cedure with SBR, reusing the very same train/test 
splits. Since correlated objects, e.g. a protein and its 
domains/residues, share information, the folds were 
structured as to avoid such information to leak between 
train and test folds: this was achieved by keeping cor- 
related objects in the same fold. In order not to bias 
the performance estimates, all redundant proteins were 
ignored, along with their domains and residues, when 
computing the results of both SBR and the method of 
Yip et al The full experimental setup and instructions 
to replicate the experiments can be downloaded at 
http://sites.google.com/site/semanticbasedregularization/ 
home/software/protein_interaction. 

SBR has two scalar hyper-parameters that control the 
contribution of various parts of the objective function: Xc 
is the weight associated to the constraints (how much the 
current solution is consistent with respect to the rules) 
and kr, which controls the model complexity (see the 
Methods section for more details). The kr parameter was 
optimized on the first fold by training the model with- 
out the logic rules and it was then kept fixed for all the 
folds of the /c-fold cross-validation. The resulting value is 
kr = 0.1. The Xc parameter has not been optimized and 
kept fixed at Xc = 1. Please note that further significant 
gains for the proposed method could be achieved by fine- 
tuning this meta-parameter. However, since the dataset 
from Yip et al does not include a validation split, no 
sound way to optimize this parameter was possible with- 
out looking at the test set, or redefining the splits (making 
it difficult to compare against the results of Yip et al). 
Therefore, we decided to not perform any tuning for this 
meta-parameter. 

We computed three performance metrics: the Receiver 
Operating Characteristic (ROC) curve, the area under the 
ROC (AUCROC, or AUC for short), and the Fi score. The 
ROC curve represents the relation between the false pos- 
itive rate (FPR) and the true positive rate (TPR), and can 
be seen as the proportion of true positives gained by "pay- 
ing" a given proportion of false positives. By definition, the 
ROC curve is monotonically non-decreasing; the steeper 
the curve, the better the predictions. The AUC measures 
the ability to correctly discriminate between positives and 
negatives, or alternatively, the ability to rank positives 
above negatives. It is independent of any classification 
threshold, and thus particularly fit to evaluate models 
over the whole spectrum of possible decision thresholds. 
The Fi score is the harmonic mean of precision and 
recall. Contrary to the AUC, the Fi takes into account the 
predicted label, but not its confidence (margin). 

We computed the average AUC and Fi of our method 
and those of our competitor over all folds of the cross- 
validation; the results can be found in Table 3 and Table 4. 
The ROC curves have been computed by collating the 
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Table 3 Results (AUG) 





Independent 


Unidirectional 




Bidirectional 






Full 


Level 




P^D 


D^R P^R 


P^D 


D^R 


P^R 




Results for Yip eta/. [21] 


Proteins 


0.723 






0.722 




0.725 


0.724 


Domains 


0.531 


0.619 




0.688 


0.695 




0.673 


Residues 


0.563 




0.542 0.549 




0.576 


0.659 


0.722 


Results for SBR 


Proteins 


0.808 






0.820 




0.819 


0.820 


Domains 


0.605 


0.814 




0.837 


0.896 




0.937 


Residues 


0.591 




0.664 0.671 




0.675 


0.673 


0.676 


Results for SBR-B,^ 


Proteins 


0.808 






0.820 




0.819 


0.821 


Domains 


0.605 


0.814 




0.837 


0.895 




0.956 


Residues 


0.591 




0.745 0.760 




0.778 


0.772 


0.797 



Area under the ROC curve values attained by Yip et al. [21 ], SBR, and SBR-3n (SBR equipped with the n-existential quantifier). 



results of all test folds, and can be found in Figure 3. Since 
the ROC and Fi are not present in [21], and the dataset 
is slightly smaller because of the redundancy elimination 
step we introduced, we had to compute their results on a 
local re-run of their experiment. As a result, the AUG val- 
ues presented in Table 3 are slightly different from those 
reported in [21]. However, we note that the results of our 
analysis would still apply if we had chosen to use the AUG 
values reported in [21]. 

Results 

To evaluate the effects of the constraints on the per- 
formances of SBR, we performed three independent 



experiments using rules of increasing complexity. This 
setup follows closely that of Yip et al [21]. 

Independent levels 

As a baseline, we estimate the performance of our method 
when constraints are ignored. This is equivalent to the 
method of Yip et al when no information flow between 
levels is allowed. The results can be found in the "Inde- 
pendent" column of Tables 3 and 4. 

In absence of constraints SBR reduces to standard £2- 
regularized SVM classification: learning and inference 
become convex problems, and the method computes 
the globally optimal solution. Thus, the only differences 



Table 4 Results (fi) 





Independent 


Unidirectional 




Bidirectional 






Full 


Level 




P^D 


D^R P^R 


P^D 


D^R 


P^R 




Results for Yip eta/. [21] 


Proteins 


0.665 






0.665 




0.666 


0.666 


Domains 


0.518 


0.620 




0.662 


0.659 




0.661 


Residues 


0.522 




0.510 0.514 




0.602 


0.609 


0.613 


Results for SBR 


Proteins 


0.718 






0.722 




0.722 


0.723 


Domains 


0.568 


0.693 




0.696 


0.731 




0.750 


Residues 


0.579 




0.605 0.605 




0.605 


0.605 


0.602 


Results for SBR-3„ 


Proteins 


0.717 






0.722 




0.722 


0.722 


Domains 


0.568 


0.693 




0.696 


0.729 




0.757 


Residues 


0.579 




0.635 0.639 




0.641 


0.644 


0.650 



Fi values attained by Yipefo/. [21], SBR, and SBR-3n (SBR equipped with the n-existential quantifier). 
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False Positive Rate False Positive Rate 




False Positive Rate False Positive Rate 



Figure 3 ROC curves. ROC curves obtained with tlie 1 0-fold cross-validation procedure, for all experimental settings and all levels of the hierarchy. 
(Left) Results for Yip etol. (Right) Results for SBR-3n. (Top) ROC curves for protein-level predictions with different sets of constraints, from fully 
independent to fully connected levels. (Middle) Domain-level predictions. (Bottom) Residue-level predictions. Each plot includes multiple ROC 
curves, one for each experimental setting; see the legends for more details. 



between our method and the competitor are: (i) using 
classification versus regression, and (ii) using pairwise 
classification, instead of training a single model for each 
entity (protein, domain, residues). These differences alone 
produce a substantial increase in performance: the Fi 
changes by about +0.05 in all three cases. The AUG of 
proteins and domains is improved by about +0.09 and 
+0.07, respectively, while residues are less affected, with a 
+0.03 difference. 

Unidirectional constraints 

In the second experiment, we evaluate the effect of intro- 
ducing unidirectional constraints between pairs of levels. 



In the P^D case only the P^D rule is active, meaning 
that bound protein pairs enforce positive domain pairs 
and negative domain pairs enforce negative protein pairs. 
The D^R and P^R cases are defined similarly. In all 
three cases, the level not appearing in the rule (e.g. the 
residue level in the P^D case) is predicted independently. 
This setup makes it easy to study the effects of propa- 
gating information from one level to the other without 
interferences. The results can be found in the "Unidirec- 
tional" column of Tables 3 and 4. In the same column 
we also show the results for Yip et al. for the unidirec- 
tional flow setting, where examples are propagated from 
one level to the next but not vice versa. However, since 
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the competitor s algorithm is iterative, information about 
lower levels can indeed affect the upper levels in succes- 
sive iterations. 

The results show that introducing unidirectional con- 
straints in SBR improves the predictions in all cases. In 
particular, using (predicted and known) protein interac- 
tions helps inferring correct domain interactions, which 
improve by about +0.13 Fi and +0.2 AUG (P^D case). 
Residues improve independently of whether protein or 
domain-level information is used, with a +0.03 Fi in 
both cases, and a +0.08/+0.07 AUG difference, respec- 
tively. Interestingly, proteins tend to help residue predic- 
tions slightly more than domains, despite the indirection 
between the two levels; this is likely an effect of the larger 
percentage of supervised pairs available. 

Gompared to SBR, the method of Yip et al, does not 
benefit as much from unidirectional information flow. 
Protein-level information allows to improve domain pre- 
dictions only (+0.1 Fi, +0.06 AUG for P^D), while 
residue predictions are worse than in the independent 
case (-0.05 and -0.04 AUG, and -0.01 Fi, in the D^R 
and P^R cases, respectively). 

Bidirectional constraints 

In the third experiment we study the impact of using 
bidirectional constraints between pairs of levels; the level 
not appearing in the rules is predicted independently, as 
above. In the P^D case, both the P^D and D^P rules 
are active, meaning that the protein and domain levels are 
enforced to be fully consistent; the P<^R and D^R cases 
are defined analogously. This experiment is comparable to 
the bidirectional flow setting of Yip et al. The results can 
be found in the "Bidirectional" column of Tables 3 and 4. 

We observe that the new constraints have a positive 
effect on predictions at all three levels: proteins change 
from 0.808 AUG to 0.820, domains from 0.814 to 0.896 
and residues from 0.671 to 0.673. In terms of F\, the 
changes are from 0.718 to (up to) 0.722 for proteins, from 
0.693 to 0.731 for domains, and no change for residues. 
The change is not as marked as between the independent 
and unidirectional experiments. In particular, domains 
see the largest increase in performance (+0.08 AUG, 
+0.04 Fi), in particular thanks to the contribution of 
residue-level information, which is more abundant. Pro- 
teins and residues are less affected. The result is unsur- 
prising for proteins, which hold most of the supervision 
and are thus (i) more likely to be predicted correctly in the 
independent setting, and (ii) less likely to be assisted from 
hints coming from the other, less supervised levels. 

As for the method of Yip et al, the bidirectional 
flow mostly affects the domain and residue levels, 
whose improvement is +0.07 AUG/+0.04 Fi and +0.11 
AUG/+0.09 Fly respectively; the change for protein 
interactions is negligible. Regardless of the relative 



performance increase, SBR is able to largely outperform 
the competitor in all configurations except one (Fi of the 
P^R case for residues). 

We note that the fact that all three cases (P^D, P^R 
and D^R) improve over both the independent and the 
unidirectional experiments shows that not only the bidi- 
rectional constraints are in fact sound, but also that, 
despite the increased computational complexity, SBR is 
still able to exploit them appropriately. 

All constraints 

In the final experiment we activate the P^D,D^P,D^R 
and R^D rules, as defined in Table 2, making all levels 
interact. This is the most complex setting, and produces 
fully consistent predictions through the hierarchy. It is 
comparable to the "PDR" bidirectional setting of Yip etal. 
The AUG scores can be found in column "Full" of Tables 3 
and 4. 

In this experiment the P^R and R^P constraints are 
not used. Direct information flow between proteins and 
residues is not needed, because it would be redundant: 
from a formal logic point of view, this corresponds to 
the observation that the logic rule expressing protein to 
residue consistency is implied by the other consistency 
rules. Indeed, we have experimentally verified that adding 
this propagation flow does not significantly affect the 
results. 

In this experiment, protein predictions are stable with 
respect to the previous experiments, confirming the intu- 
ition that the abundance of supervision at this level makes 
it less likely to benefit from predictions at the other 
ones. On the contrary, domains see a large performance 
upgrade, from 0.896 to 0.937 AUG and from 0.731 to 0.750 
Fi, when made to interact with both proteins and residues. 
The change for residues is instead only marginal. 

The results for Yip et al are mixed, with proteins faring 
almost identically to the previous experiment, domains 
showing a slight drop in AUG but an equally slight 
increase in Fi, and residues improving in AUG (+0.08) 
but not in F\ (unchanged) over the bidirectional P<^R 
case. The improvement in residue prediction (in terms 
of AUG) stands in contrast with the results of SBR, and 
is the only case in which the method of Yip and col- 
leagues works better than SBR. The issue lies within our 
formulation of the D^R rule: whenever two domains are 
bound, the rule is satisfied when at least one residue pair 
is bound. As already mentioned above, this is not realistic: 
protein interfaces span more than two residues, typically 
five or more. We therefore extended SBR to support the 
^-existential quantifier, which allows to reformulate 
the D^R rule to take this observation into account (see 
the Methods section for more details on the w- existential 
quantifier). The new D^R rule, shown in Table 2, requires 
for each pair of bound domains at least fz = 5 residues 
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to be bound. We chose the constant n = 5 to be both 
realistic and, since the computational cost increases with 
n, small enough to be easily computable. We applied the 
same modification to the P^R rule. 

The complete results for the resulting method, termed 
can be found at the bottom of Tables 3 and 
4. When comparing to standard SBR, i.e., without the 
/7-existential, we see that the performance of residues 
consistently improves in all cases (unidirectional, bidirec- 
tional, and with all constraints activated), allowing SBR- 3^ 
to always outperform the method of Yip et al, by a sig- 
nificant margin also on the residue interactions. As a 
side effect of the better residue predictions, thanks to the 
D^R and R^D rules domains also improve in the all- 
constraints experiment. In particular, in the "Full" exper- 
iment the AUC improvement of SBR-3^ over Yip et al. 
is +0.1/+0.26/+0.07 AUC and +0.06/+0.1/+0.04 Fx for 
proteins, domains and residues respectively. We show in 
Figure 4 an example prediction obtained by SBR-3^ for 
the VPS25 and VPS36 ESCRT-II complex subunits. The 
figure shows that, while the unconstrained (baseline) pre- 
dictions are inconsistent, the addition of the constraints 
effectively makes the protein- and domain-level predic- 
tions correct and consistent, and enables SBR to improve 
the residue-level predictions. 

Summing up, these results highlight the ability of SBR 
to enforce constraints even with highly complex combi- 
nations of rules, allowing the modeler to fully exploit the 
flexibility and performance improvement offered by non- 
standard FOL extensions like the n-existential operator. 

Discussion 

The results presented in the previous section offer a clear 
perspective on the advantages of the proposed method. 
By employing appropriate classification techniques and 
training a single global pairwise model per level, rather 
than relying on the less than optimal local (per-object) 



regression models of Yip et al., a considerable improve- 
ment was achieved even in the unconstrained experiment. 
Furthermore, when enforcing consistency among the pro- 
tein, domain and residue levels and using the n-existential 
quantifier, the experimental results are significantly better 
than both the unconstrained baseline and the correspond- 
ing results of Yip and colleagues, at all levels and in all 
experimental settings. 

It is worth noting that SBR performance improves 
monotonically with the increase of constraint complexity 
in the reported experiments. This result is far from obvi- 
ous, and confirms both that the biologically-motivated 
knowledge base is useful, and that SBR is able to effec- 
tively apply it. In contrast, the competitor s method does 
not always improve in a similar manner. 

In general, the performance gain brought forth by 
inter-level propagation is not homogeneously distributed 
between the three levels. We register a large improvement 
for domains and residues, especially when SBR is used in 
conjunction with the n-existential quantifier. Proteins are 
less affected by consistency enforcement, most likely due 
to the availability of more supervised examples. 

We note that the FOL rules have a twofold effect. Firstly, 
they propagate information between the levels, enabling 
predicted interactions at one level to help inferring cor- 
rect interactions at the other two levels. This is especially 
clear in the "Full" experiment with the n-existential quan- 
tifier: in this case, better residue level predictions increase 
the overall quality of domain predictions as well. Secondly, 
the rules also guarantee that the predictions are consistent 
along the object hierarchy. 

Summarizing, SBR is able to largely outperform that of 
Yip and colleagues, and moreover enforces the predictions 
to be consistent among levels. As previously mentioned, 
the data taken from Yip et al. has some peculiarities worth 
discussing. First, it contains a low number of residue- 
residue interactions, partially due to design choices taken 



INDEPENDENT 



FULL 




Figure 4 Example prediction. Prediction for tine interaction between two ESCRT-II complex subunits: VPS25 (YJRl 02C) and VPS36 (YLR41 7W).The 
two proteins, their domains, and all the residue pairs in the dataset, are known to interact. Solid black lines indicate a predicted interaction, dotted 
lines a non-interaction; residue pairs not connected by either a solid or dotted line are not present in the dataset. (Left) SBR-3n predictions with no 
constraints: the predictions at the three levels are inconsistent. (Right) SBR-3n predictions with the full set of constraints: the protein- and 
domain-level predictions are now both consistent and correct. A further residue pair is now correctly predicted as interacting. 
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by [21]. Second, it is artificially balanced by including an 
appropriate number of non-interactions, while in a real- 
world case all possible pairs would qualify as candidates. 
We decided to keep the dataset as-is in order to facili- 
tate a fair comparison with Yip et al. We postpone further 
analysis with other datasets to future work. 

Conclusions 

In this work we tackle the multi-level protein interaction 
prediction (MLPIP) problem, first introduced by Yip et al 
[21], which requires to establish the binding state of all 
uncharacterized pairs of proteins, domains and residues. 
Contrary to standard protein-protein interaction predic- 
tion, the MLPIP problem offers many advantages and 
opens up new challenges. The primary contribution of this 
paper is the extension and appUcation to the MLPIP task 
of a state-of-the-art statistical relational learning tech- 
nique called Semantic Based Regularization. 

SBR is a flexible framework to inject domain knowledge 
into kernel machines. In this paper SBR has been used 
to tie together protein, domain and residue interaction 
predictions tasks. In particular, the domain knowledge 
expresses that two proteins interact if and only if there is 
an interaction between at least one pair of domains of the 
proteins. Similarly two domains can interact if and only if 
there are at least some residues interacting. While these 
tasks could be learned separately, tying them together has 
multiple advantages. First the predictions will be consis- 
tent and more accurate, as the predictions at one level 
will help the predictions at the other levels. Secondly, the 
domain knowledge can be enforced also on the unsu- 
pervised data (proteins, domains and residues for which 
interactions are unknown). Unsupervised data is typically 
abundant in protein interaction prediction tasks but often 
neglected. This methodology allows to powerfully lever- 
age it, significantly improving the prediction accuracy. 
Note also that, while the resolved complexes are required 
during the training stage, no structural information is 
required for performing inference on novel proteins. 

While other work in the literature has exploited the 
possibility of tying the predictions at multiple levels, the 
presented methodology employs a more principled infer- 
ence process among the levels, where the domain knowl- 
edge can be exactly represented and precisely enforced. 
The experimental results confirm the theoretical advan- 
tages by showing significant improvements in domain and 
residue interaction prediction accuracy both with respect 
to approaches performing independent predictions and 
the only previous approach attempting at linking the pre- 
diction tasks. 

Given the flexibility offered by SBR, the proposed 
method can be extended in several ways. The simplest 
extension involves engineering a more refined rule set, 
for instance by introducing (soft) constraints between the 



binding state of consecutive residues, which are likely to 
share the same state. More ambitious goals, requiring a 
redesign of the experimental dataset, include encoding 
selected information sources, such as domain types, sub- 
cellular co-localization and Gene Ontology annotations, 
as First Order Logic constraints rather than with kernels, 
to better leverage their relational nature. 

Methods 

Kernel machines 

Machine learning and statistical methods are very well 
defined for the linear case, and statistical learning the- 
ory can provide optimal solutions in terms of gen- 
eralization performance. Unfortunately, non-linearity is 
often required in order to solve most applications, where 
exploiting complex dependencies is essential to predict 
some higher level property of the objects. Kernel meth- 
ods try to combine the potential classification power of 
non-linear methods and the optimality and computational 
efficiency of linear methods by mapping the input patterns 
into a high dimensional feature space, where parameter 
optimization remains linear. 

Kernel methods have a wide range of applications in 
many fields, and can be used for many different tasks like 
regression, clustering and classification (the latter being 
the main interest of this paper). In particular, the repre- 
senter Theorem [46] shows that a large class of problems 
admits solutions in terms of kernel expansions having the 
following form: 

N 

f{x) = Y^WiK{x,Xi) (2) 

i=l 

where x is the representation of the pattern, K{x,Xi) =< 
<^(x),<^(Xi) > is a kernel function, where O(-) is some 
mapping from the input space to the feature space. 
Intuitively, the kernel function measures the similarity 
between pairs of instances, and the prediction /(^v) for 
a novel instance is computed as a weighted similarity to 
training instances Xi, There is a large body of literature on 
kernel machines, see e.g. [26] for an introduction. 

The optimization of the weights w/ of the Kernel 
machine can be formulated in different ways. Let us indi- 
cate yj e {— 1, +1} the desired output for pattern x^ 
w = [wiy . . .yWyi] is a vector arranging the kernel machine 
parameters and G is the gram matrix, having its (/,/) ele- 
ment defined as Gij = K(XifXj), \ \f\\^ = w^Gw, and it can 
be shown that the following cost function: 

N 

\{f^\\+kJ2^(yjJ(xj)) 

reduces to the formulation of hard margin £2SVMs [23] 
if L{-) is the hinge loss and A ^ oo. A very similar cost 
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function has been employed to solve the protein, domain, 
residue interaction presented in this paper. 

First-order logic 

Propositional logic is based on the basic concept of propo- 
sitions, which can assume either a true or false value. It 
is possible to perform operations on the propositions by 
connecting them via the and (a), or (v) and not (-<) oper- 
ators. In particular, given two propositions A, 5, it holds 
that: A ABis true iff A = true, B = true, Av Bis false iff 
A = false, B = false and ^A flips the current truth value 
of A. The operator ^ can be used to express a conditional 
statement: A ^ B expresses the fact that B must hold true 
if A is true. The sentence A ^ B is false iff A = true 
and B = false, and it can be expressed in terms of other 
operators through the equivalence A ^ B = ^A V B, 

First-order logic (FOL) extends propositional logic to 
compactly express generic properties for a class of objects, 
thanks to the use of predicates, variables and quantifiers. 
A variable can assume as value any object in some con- 
sidered domain. A variable is said to be grounded once 
it is assigned a specific object. A predicate is a function 
that, taking as input some objects (or grounded variables), 
returns either true or false. Predicates can be connected 
with other predicates using the same operators defined for 
propositional logic. The universal quantifier (V) expresses 
the fact that some proposition is true for any object, while 
the existential quantifier (3) expresses the fact that some 
proposition is true for at least one object. 

For example, let x be a variable and let Protein (x) , 
Enzyme (x) ,NonEnzyme (x) indicate three predicates 
expressing whether, given a grounding x=PDBla3a, 
PDBlaSa is a protein, an enzyme, a non-enzyme, respec- 
tively. The following FOL clause can be used to express 
that any protein is either an enzyme or it is not: 
Vx Protein (x) =^ Enzyme (x) V NonEnzyme (x) . 

Variables and quantifiers can be combined. For exam- 
ple, given the predicates Protein (x) holding true if x 
is a protein and ResidueOf (x, y) holding true if y is 
a residue of x, the following clause expresses the fact that 
any protein has a at least one residue: VxProtein(x) ^ 
3y ResidueOf (x, y) . 

Semantic-based regularization 

Semantic Based Regularization (SBR) [22] is a general 
framework for injecting prior knowledge expressed in 
FOL into kernel machines for semi-supervised learning 
tasks. The prior knowledge is converted into a set of con- 
tinuous constraints, which are enforced during training. 
The SBR framework is very general and allows to employ 
the full expressiveness of FOL in the definition of the prior 
knowledge. The SBR framework also allows to perform 
collective classification on the test set, in order to enforce 
the output to respect the logic knowledge. 



Let us consider a multitask learning problem, where 
each task works on an input domain where labeled and 
unlabeled examples are sampled from. For example, in the 
case study presented in this paper, three separate tasks 
for protein, domain and residue interaction need to be 
conjunctively learned. Each input pattern is described via 
some representation that is relevant to solve the tasks at 
hand. Let us indicate with T the total number of tasks, 
where task k is implemented by a function/^, which lives 
in an appropriate Reproducing Kernel Hilbert Space. In 
the following,/ = [/i, . . ..fj]' indicates the vector col- 
lecting all task functions. The basic assumption of SBR 
is that the task functions are correlated as they have to 
meet a set of constraints that can be expressed by the 
functional 0;^ : 'Hi x . . . x ^[0, +oo) such that 
(j)Pi(f) = 0 h = 1,...,H must hold for any valid choice 
of ^ G f< = h • • - ^T. Following the classical penalty 
approach for constrained optimization, the constraints are 
embedded by adding a term that penalizes their violation: 

T T 

'='m^^^ (3) 

H 

+ XcY,(t>h{S,f), 

h=l 

where L( ) is a loss function measuring the distance of 
the function output from the desired one and S is the 
considered sample of data points over which the func- 
tions are evaluated. In the experimental setting, L{-) has 
been set to be the hinge function. It is possible to extend 
the Representer Theorem to show that the best solution 
for Equation 3 can be expressed as a kernel expansion as 
showed in Equation 2 [22]. 

Therefore, using kernel expansions. Equation 3 
becomes: 

T T 

Xr XI ^k^k^k + E ^ {Gk^h Vk) + 

k=l k=l 

H 

+ Ac X (phiGiWi, . . . , GtWt), 

h=i 

where G^, w^, fj^ = Gj^wj^ and are the gram matrix, 
the weights, the function values over the data sample and 
the desired output column vectors for the patterns in the 
domain of the /c-th task. Evidence tasks do not need to be 
approximated as their are fully known. 

Optimization of the parameters for the cost function 
in Equation 4 can be done using gradient descent. Con- 
straint (ph are non-linear in most interesting cases like the 
one presented in this paper. Therefore, the cost function 
can present multiple local minima, making optimization 
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difficult. SBR uses a two-step heuristic to solve this prob- 
lem: first it computes the theoretically global optimum 
for all predicates independently (setting Xc = 0), which 
are convex kernel machines. Then, it introduces the con- 
straints and proceeds to find a good solution using a 
gradient descent. 

In the following we show how to express first order logic 
clauses in terms of constraints 0/^. 

Translation of first-order logic clauses into real-valued 
constraints 

With no loss of generality, we restrict our attention to FOL 
clauses in the PNF form, where all the quantifiers (V, 3) 
and their associated quantified variables are placed at the 
beginning of the clause. For example: 



Quantifiers Quantifier— free expression 
k(vi)A£(V2)^C(vi) 



(4) 



Please note that the quantifier-free part of the expres- 
sion is equivalent to an assertion in propositional logic for 
any given grounding of the quantified variables. As stud- 
ied in the context of fuzzy logic and symbolic AI, different 
methods can be used for the conversion of a propositional 
expression into a continuous function with [0, 1] input 
variables. 

T-norms 

T-norms [47] are commonly used in fuzzy logic [48] to 
generalize propositional logic expressions to real valued 
functions of continuos variables. A continuous t-norm is 
a function t :[0, 1] x [0, 1] ^ R, that is continuous, com- 
mutative, associative, monotonic, and featuring a neutral 
element 1 (i.e. t{a,l) = a), A t-norm fuzzy logic is defined 
by its t-norm t{ai,a2) that models the logic AND, while 
the negation of a variable -^a is computed as 1 — Once 
defined the t-norm functions corresponding to the logi- 
cal AND and NOT, these functions can be composed to 
convert any arbitrary logic proposition into a continuous 
function. Many different t-norm logics have been pro- 
posed in the literature. For example, the product t-norm 
used in the experimental section: 



{ai A a2) 
(ai V a2) 



mapped 
mapped 
mapped 



t(ai) a2) = ci\ ' a2 

t(ai) = 1 — ai 

t(ai, a2) = ai -\- a2 — cLi ' a2 



Please note that the t-norm behaves as classical logic when 
the variable approaches the value 0 (false) or 1 (true). 

The equivalence ai =^ a2 = ^a\\/ a2 can be used 
to represent implications {modus ponens) before perform- 
ing t-norm conversion. However, this process does not 
powerfully capture the inference process performed in 
a probabilistic or fuzzy logic context. Any t-norm has a 
corresponding binary operator =^ called residuum^ which 



is used in fuzzy logic to generalize implications in case 
of continuous variables. In particular, for a minimum 
t-norm, it holds that the residuum converting an implica- 
tion is defined as: 



(cLi a2) 



(5) 



mapped 1 Ui < a2 

— > t{aif a2) — \ 

[a2 ai > a2 

The residuum allows to relax the condition of satisfac- 
tion for the implication: an implication is satisfied if 
the right end side of the implication is more verified 
than the pre-condition on the left side. This makes the 
fuzzy or probabilistic inference process easier and better 
defined. While the original SBR formulation represents 
implications using modus ponens, the minimum t-norm 
residuum has been used in the experimental section of this 
paper to convert implications. 

For example, the quantifier-free expression in 
Equation 4 corresponds to: 

\fB(x2) else 

where the predicates have been substituted by the 
unknown corresponding functions and Xi,X2 are the rep- 
resentations of the objects identified by the grounded 
variables vi,V2, respectively. The representation of the 
object must be compatible with what is accepted as input 
by the kernels used by the predicate approximations/4,^. 
For example, they can be a vector of real valued features 
when using a linear or Gaussian kernel, or graph or tree 
representations when using kernel for structures. It is also 
possible to use the methodology when no explicit repre- 
sentations are known, but only the kernel values for each 
pair of input objects. 

Quantifiers are also converted into real value operators. 
The universal quantifier corresponds to the sum of the 
degrees of violation of the continuous expression coming 
from t-norms over all possible groundings for the quanti- 
fied variable. Let us consider a universally quantified FOL 
formula Vv E{v,V). When considering the real-valued 
mapping tE(f, x) of the original boolean expression where 
X is the representation of v, the universal quantifier can be 
converted measuring the degree of non- satisfaction of the 
expression over the domain S of x: 



Vv E{y,V) 



mapped 



4>(f,S) = J2^-iE(f.x) 



xeS 



For example, the formula Vv^(v) A B{v) corresponds to: 

<t>(f,S) = J2^-fA(x)fBix) 

xeS 

where fA(x)fB(x) is the t-norm generalization of the 
propositional expression A (v) aB(v) for a given grounding 
of V. 

When multiple universally quantified variables are 
present, the conversion is performed recursively 
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from the outer to the inner variable. In particular, 
Vvi . . . Vv^ E(V, vi, . . . , v^) is mapped to the constraint: 

For example, the formula VviVv2 A{vi) A —^B(v2) corre- 
sponds to: 

<l>{f>S)= J2 J2 l-fA(Xi)(l-fB(X2)) 
xieSi X2eS2 

The existential quantifier is mapped into the continuos 
domain as: 



3vE(V, v) 



mapped 



min 1 — tEifyX) 

xeS 



This framework also allows a natural definition of the 
operator, generalizing the existential operator to n objects. 
This operator is usually defined in description logic, while 
it can only indirectly defined in FOL. This operator will 
be used in the experimental section and its continuous 
mapping is defined as: 



3nV Eiv.V) 



mapped 



4>(f,S)= 1- 

arg m-aXy^x&Sn 



■tE(f.x) 



where argmaXnX G Sn indicates the n assignments of x 
that maximize the value of 1 — tE{') over the set 5. The 
conversion of the 3^ operator consistently reduces to the 
V conversion when n= and to the conversion of the 3 
operator when n = 1. 

As a final example of the conversion procedure, let s 
refer to the FOL clause in Equation 4, which is converted 
into the real-valued constraint 0 (f, S): 



V V 1^ /^(^l) 'fBiX2) <fciXi) 

^ ^ [1-/5(^2) else 
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